77 research outputs found

    Metadata and provenance management

    Get PDF
    Scientists today collect, analyze, and generate TeraBytes and PetaBytes of data. These data are often shared and further processed and analyzed among collaborators. In order to facilitate sharing and data interpretations, data need to carry with it metadata about how the data was collected or generated, and provenance information about how the data was processed. This chapter describes metadata and provenance in the context of the data lifecycle. It also gives an overview of the approaches to metadata and provenance management, followed by examples of how applications use metadata and provenance in their scientific processes

    Performance and design evaluation of the RAID-II storage server

    Get PDF
    RAID-II is a high-bandwidth, network-attached storage server designed and implemented at the University of California at Berkeley. In this paper, we measure the performance of RAID-II and evaluate various architectural decisions made during the design process. We first measure the end-to-end performance of the system to be approximately 20 MB/s for both disk array reads and writes. We then perform a bottleneck analysis by examining the performance of each individual subsystem and conclude that the disk subsystem limits performance. By adding a custom interconnect board with a high-speed memory and bus system and parity engine, we are able to achieve a performance speedup of 8 to 15 over a comparative system using only off-the-shelf hardware.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/44815/1/10619_2005_Article_BF01266330.pd

    Challenges for Tertiary Storage in Multimedia Servers

    No full text
    The low cost per megabyte of optical disk and magnetic tape storage make these technologies particularly attractive for use in large capacity storage servers, including multimedia servers. However, these devices have performance problems that range from high costs for many optical drives to low performance and lack of random access in tape drives. We evaluate the performance on multimedia applications of several tertiary storage systems, including optical disk jukeboxes, arrays of optical drives, and storage hierarchies composed of magnetic disk arrays and magnetic tape libraries. We conclude that striped arrays of next-generation write-once optical disks will offer the best performance at acceptable cost. 1 Introduction Tertiary storage devices such as optical disk drives and tape drives are characterized by very low cost per megabyte of storage capacity. As a result, tertiary devices appear particularly appropriate for use in large capacity storage servers, including those that stor..

    Storage Systems for Movies-on-Demand Video Servers

    No full text
    In this paper, we evaluate storage system alternatives for movies-on-demand video servers. We begin by characterizing the movies-on-demand workload. Then we study disk farms in which one movie is stored per disk. This is a simple scheme, but it wastes substantial disk bandwidth, since disks holding less popular movies are under-utilized; also, good performance requires that movies be replicated to reflect the user request pattern. Next, we examine disk farms in which movies are striped across disks, and find that striped video servers offer close to full utilization of the disks by achieving better load balancing. Finally, we evaluate the use of storage hierarchies for video service that include a tertiary library along with a disk farm. Unfortunately, we show that the performance of neither magnetic tape libraries nor optical disk jukeboxes as part of a storage hierarchy is adequate to service the predicted distribution of movie accesses. We suggest changes to tertiary libraries that ..

    Performance Measurements of the First RAID Prototype

    Get PDF
    This paper examines the performance of RAID the First, a prototype disk array built by the RAID group at U.C. Berkeley. A hierarchy of bottlenecks was discovered in the system that limit overall performance. The most serious is the memory system contention on the Sun4/280 host CPU, which limits array bandwidth to 2.3 MBytes/sec. The array performs more successfully on small random operations, achieving nearly 300 I/Os per second before the Sun4/280 becomes CPU-limited. Other bottlenecks in the system are the VME backplane, bandwidth on the disk controller, and overheads associated with the SCSI protocol. All are examined in detail. The main conclusion of this report is that to achieve the potential bandwidth of arrays, more powerful CPUs alone will not suffice. Just as important are adequate host memory bandwidth and support for high bandwidth on disk controllers. Current disk controllers are more often designed to achieve large numbers of small random operations, rather than..

    Scheduling data-intensive workflows on storage constrained resources

    No full text
    Data-intensive workflows stage large amounts of data in and out of compute resources. The data staging strategies em-ployed during the execution of such workflows can have a significant impact on the time taken to complete the execu-tion or on the overall cost of the execution. We describe the problem of minimizing the overall time taken for execution and present a heuristic based on ordering clean-up jobs in the workflow. Next, we develop genetic algorithm based ap-proaches to solving the same problem and demonstrate that the results obtained with the heuristic are comparable to the best results obtained with the genetic algorithm based approaches. We also describe the problem of minimizing the overall cost of execution and extend our genetic algorithm to generate schedules that vary the number of processors and the amount of storage provisioned for execution to generate low cost schedules
    • …
    corecore